A Space Efficient Persistent Implementation of an Index for DNA Sequences
نویسنده
چکیده
Due to newly developed high-throughput technologies for DNA sequencing, the number of fully sequenced species increases rapidly. String databases holding these sequences are very large. On the eld of molecular biology the handling of large string data which cannot be broken in words is a great challenge. Hereby the most important string operation is the approximate substring match. This type of match is essential for many applications in biology. The suÆx tree has been established as the most predestined in-memory data structure supporting approximate substring matches on DNA sequences. They are belonging to the wider class of suÆx structures. The central issue of the paper is the theoretically evaluation of the suÆx structures with the aim to reveal the most suitable structure for a persistent implementation on the disk. I will show that this is a variant of a suÆx tree. Further, the paper addresses the question in which way this data structure can be stored on disk, and how fast access can be achieved. In this connection I want to introduce a space eÆcient implementation by using a tree coding scheme which optimizes disk saving and therefore preventing unneeded disk access. Here the most important design issue lies in the representation of di erent variants of nodes of the tree. I will present an implementation in C programming language and show that the implementation in a rst evaluation step compares favourably to other implementations.
منابع مشابه
gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملDevelopment of an Efficient Hybrid Method for Motif Discovery in DNA Sequences
This work presents a hybrid method for motif discovery in DNA sequences. The proposed method called SPSO-Lk, borrows the concept of Chebyshev polynomials and uses the stochastic local search to improve the performance of the basic PSO algorithm as a motif finder. The Chebyshev polynomial concept encourages us to use a linear combination of previously discovered velocities beyond that proposed b...
متن کاملComputation of the Sadhana (Sd) Index of Linear Phenylenes and Corresponding Hexagonal Sequences
The Sadhana index (Sd) is a newly introduced cyclic index. Efficient formulae for calculating the Sd (Sadhana) index of linear phenylenes are given and a simple relation is established between the Sd index of phenylenes and of the corresponding hexagonal sequences.
متن کاملAn Effective Method for Detecting Y-chromosome Specific Sequences of Circulating Fetal DNA in Maternal Plasma During the First-trimester
Background and Aims: New advances in the use of cell-free fetal DNA (cffDNA) in maternal plasma of pregnant women has provided the possibility of applying cffDNA in prenatal diagnosis as a non-invasive method. One of the applications of prenatal diagnosis is fetal gender determination. Early prenatal determination of fetal sex is required for pregnant women at risk of X-linked and some endocrin...
متن کاملHigh Fuzzy Utility Based Frequent Patterns Mining Approach for Mobile Web Services Sequences
Nowadays high fuzzy utility based pattern mining is an emerging topic in data mining. It refers to discover all patterns having a high utility meeting a user-specified minimum high utility threshold. It comprises extracting patterns which are highly accessed in mobile web service sequences. Different from the traditional fuzzy approach, high fuzzy utility mining considers not only counts of mob...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003